To perform online iNMF, we need to install the online branch. Please see the instruction below.
library(devtools)
install_github("MacoskoLab/liger", ref = "online")
We first create a liger object by passing the filenames of HDF5 files containing the raw count data. The data can be downloaded here.
library(liger)
pbmcs = createLiger(list(stim = "stim_PBMCs.h5", ctrl = "ctrl_PBMCs.h5"))
We then perform the normalization, gene selection, and gene scaling in an online fashion, reading the data from disk in small batches.
pbmcs = normalize(pbmcs)
pbmcs = selectGenes(pbmcs, var.thresh = 0.2, do.plot = F)
pbmcs = scaleNotCenter(pbmcs)
Now we can use online iNMF to factorize the data, again using only minibatches that we read from the HDF5 files on demand (default mini-batch size = 5000).
pbmcs = online_iNMF(pbmcs, k = 20, max.epochs = 5)
After performing the factorization, we can perform quantile normalization to align the datasets.
pbmcs = quantile_norm(pbmcs)
We can also visualize the cell factor loadings in two dimensions using t-SNE or UMAP.
pbmcs = runUMAP(pbmcs)
plotByDatasetAndCluster(pbmcs, axis.labels = c("UMAP1","UMAP2"))
We can also compare clusters or datasets (within each cluster) to identify differentially expressed genes. The runWilcoxon function performs differential expression analysis by sampling a specified number of cells from the dataset on disk, then performing an in-memory Wilcoxon rank-sum test on this subset. Thus, users can still analyze large datasets with a fixed amount of memory.
de_genes = runWilcoxon(pbmcs, compare.method = "clusters", max.sample = 5000)
head(de_genes)
## feature group avgExpr logFC statistic auc pval
## 1 AL627309.1 1 -23.02585 -0.006801316 746300.5 0.4997864 0.7123956
## 2 RP11-206L10.2 1 -23.02585 -0.003373260 746460.0 0.4998932 0.7946840
## 3 LINC00115 1 -23.02585 -0.089421497 742313.0 0.4971160 0.1738807
## 4 NOC2L 1 -21.44847 0.502855594 770801.5 0.5161943 0.0294685
## 5 KLHL17 1 -22.97600 0.033192611 748163.0 0.5010337 0.3023024
## 6 PLEKHN1 1 -23.02585 -0.071678476 743110.5 0.4976501 0.2198825
## padj pct_in pct_out
## 1 0.8712779 100 100
## 2 0.8985167 100 100
## 3 0.4964537 100 100
## 4 0.1521384 100 100
## 5 0.6598760 100 100
## 6 0.5574626 100 100
We can also perform online iNMF with continually arriving datasets.
MOp = createLiger(list(cells = "allen_smarter_cells.h5"))
MOp = normalize(MOp)
MOp = selectGenes(MOp, var.thresh = 2)
MOp.vargenes = MOp@var.genes
MOp = scaleNotCenter(MOp)
MOp = online_iNMF(MOp, k = 40, max.epochs = 1)
MOp = quantile_norm(MOp)
MOp = runUMAP(MOp)
plotByDatasetAndCluster(MOp, axis.labels = c("UMAP1","UMAP2"))
MOp2 = createLiger(list(nuclei = "allen_smarter_nuclei.h5"))
MOp2 = normalize(MOp2)
MOp2@var.genes = MOp@var.genes
MOp2 = scaleNotCenter(MOp2)
MOp = online_iNMF(MOp, X_new = list(nuclei = "allen_smarter_nuclei.h5"), k = 40, max.epochs = 1)
MOp = quantile_norm(MOp)
MOp = runUMAP(MOp)
plotByDatasetAndCluster(MOp, axis.labels = c("UMAP1","UMAP2"))
MOp = createLiger(list(cells = "allen_smarter_cells.h5"))
MOp@var.genes = MOp.vargenes
MOp = online_iNMF(MOp, k = 40, max.epochs = 1)
MOp = quantile_norm(MOp)
MOp = runUMAP(MOp)
plotByDatasetAndCluster(MOp, axis.labels = c("UMAP1","UMAP2"))
MOp = online_iNMF(MOp, X_new = list(nuclei = "allen_smarter_nuclei.h5"), k = 40, project = TRUE)
MOp = quantile_norm(MOp)
MOp = runUMAP(MOp)
plotByDatasetAndCluster(MOp, axis.labels = c("UMAP1","UMAP2"))